In this assignment, you will apply text mining techniques to evaluate tweets posted before, during, and after the landfall of Hurricane Harvey in the U.S. Hurricane Harvey made landfall as a category 4 storm in Texas on August 26, 2017. Classified as a major hurricane, Harvey brought massive amounts of rain to southeast Texas, resulting in widepsread and devastating flooding. Some tweets can provide useful, real-time information during natural disasters, but not all tweets provide helpful info.
Image: Hurricane Harvey’s track. The points correspond to different hurricane categories, showing that the storm was strongest prior to making landfall near Houston. Source: SSHWS/NWS.
In this assignment, you will analyze Twitter data tagged with “Hurricane Harvey” using text mining and sentiment analysis methods. The main motivation for the analysis is to answer the questions:
When did Harvey-related tweets peak in relation to when the hurricane made landfall?
What are the 20 most commonly used words in the Hurricane Harvey tweets?
What are common words used in tweets that reference refineries?
How did the average sentiment of tweets change from August 17-29, 2017?
Before starting, you should:
Open your Module 12 RStudio Project (.Rproj)
Create a new R Markdown file (.Rmd) for this homework assignment
Download the data from Moodle, and save the data to a
data subdirectory in your working directory
Load the following packages: tidyverse,
tidytext, wordcloud, textdata,
lubridate
Let’s get started! 🚀
Data analyzed in this assignment were accessed under the CC0 Public Domain license from Kaggle’s Hurricane Harvey Tweets Dataset, which was put together and shared by @dan195. The tweets were posted from August 17-29, 2017. Note: The data you will analyze are real, and may include offensive language or ideas. I (Dr. Nelson) do not endorse any information contained in the dataset. The data are in a file named hurricane_harvey_tweets.csv. Each row in the dataset corresponds to one tweet that was posted. The data include the following columns:
id: Unique tweet identification number,
<dbl>
datetime: Date and time when tweet was posted,
<dttm>
date: Date when tweet was posted,
<date>
likes: Number of likes the tweet received,
<dbl>
replies: Number of replies the tweet received,
<dbl>
retweets: Number of retweets made of the tweet,
<dbl>
tweet: A tweet that includes mention of Hurricane
Harvey, <chr>
Perform a word count analysis of all tweets in the dataset to determine the top 20 most commonly used words. To perform this analysis, you will have to:
Tokenize the data so that you create a new tibble with “one-token-per-document-per-row” format
Remove stop words. Remove the stop words
included in data(stop_words), as well as the following
custom stop words:
c("hurricane", "harvey", "hurricaneharvey", "http", "https", "html", "ift.tt", "pic.twitter.com", "twitter.com", "fb.me", "bit.ly", "dlvr.it", "youtube", "youtu.be").
Refer to chapter 1 in the “Text Mining with R” book for an example on
how to include custom stop words.
Count the frequency of each word using
count(word), and select only the top 20 most frequently
occurring words (hint: use top_n())
Visualize the frequency (n) of the top 20 words.
To create the plot shown below, you will have to include
coord_flip() in your ggplot code. To reorder the words, you
can use fct_reorder() (refer to the module on factors) or
reorder().
A large number of oil refineries are located in the area around Houston, which was directly impacted by Harvey. Were tweets that referenced refineries primarily focused on potential economic or environmental consequences? To answer this question, do the following:
Starting with the original data (not the tokenized data from Part 2), filter the data so you only keep rows with tweets containing the word “refinery” or “refineries”
Tokenize and remove stop words
(stop_words and the custom stop words from Part 2)
Create the word cloud. When creating the word
cloud, specify that max.words = 100.
Write 2-3 sentences in your R Markdown file describing whether tweets that referenced refineries seemed to emphasize the potential economic or environmental impacts.
Using the afinn sentiment lexicon, determine the average
sentiment of tweets on each date in the dataset and create
the figure shown below. To create this figure you will need to:
Join the tokenized dataset from Part 2 with the
afinn sentiment lexicon
Summarize the average sentiment value for each
date
Visualize the average sentiment for each day. To
adjust the x-axis as shown in the figure, use the following scale
function:
scale_x_date(date_breaks = "day", date_labels = "%d")
Submit the following files (2 total) to Moodle:
This assignment will be graded with the following rubric:
[2 pts] The R Markdown report is organized (1.5 pt) and has a descriptive header explaining what the report includes (0.5 pt); the descriptive header can be copied and pasted from the assignment. Basically, this text should summarize what the report includes so when you refer to it in the future, you’ll understand what the purpose of the assignment was.
[8 pts] The plots are perfectly recreated from parts 1, 2, and 4, and a word cloud is produced for part 3 (2 pts/plot)